A Maximum Entropy Approach to Kannada Part Of Speech Tagging

نویسندگان

  • Ramakanth Kumar
  • Aniket Dalal
  • Kumar Nagaraj
  • Uma Sawant
  • Sandeep Shelke
  • Siva Reddy
  • Serge Sharoff
چکیده

Part Of Speech (POS) tagging is the most important pre-processing step in almost all Natural Language Processing (NLP) applications. It is defined as the process of classifying each word in a text with its appropriate part of speech. In this paper, the probabilistic classifier technique of Maximum Entropy model is experimented for the tagging of Kannada sentences. Kannada language is agglutinative, morphologically very rich but resource poor. Hence 51267 words from EMILLE corpus were manually tagged and used as training data. The tagset included 25 tags as defined for Indian languages. The best suited feature set for the language was finalised after rigorous experiments. Data size of 2892 word forms was downloaded from Kannada websites for testing. Accuracy of 81. 6% was obtained in the experiments which prove that Maximum Entropy is well suited for Kannada language.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Two-Stage Approach to Chinese Part-of-Speech Tagging

This paper describes a Chinese part-ofspeech tagging system based on the maximum entropy model. It presents a novel two-stage approach to using the part-ofspeech tags of the words on both sides of the current word in Chinese part-of-speech tagging. The system is evaluated on four corpora at the Fourth SIGHAN Bakeoff in the close track of the Chinese part-ofspeech tagging task.

متن کامل

NING MA et al: FUSION OF WORD CLUSTERING FEATURES FOR TIBETAN PART OF SPEECH TAGGING

Tibetan Part of Speech (POS) tagging, the foundation of Tibetan natural language processing, judges word classification according to contextual information of words. Based on the framework of the maximum entropy model, the paper studied the fusion of morphological features for Tibetan part of speech with maximum entropy model with the integration of word clustering features. Experimental result...

متن کامل

Probabilistic Part Of Speech Tagging for Bahasa Indonesia

In this paper we report our work in developing Part of Speech Tagging for Bahasa Indonesia using probabilistic approaches. We use Condtional Random Fields (CRF) and Maximum Entropy methods in assigning the tag to a word. We use two tagsets containing 37 and 25 part-of-speech tags for Bahasa Indonesia. In this work we compared both methods using using two different corpora. The results of the ex...

متن کامل

Markov Random Eld Based English Part-of-speech Tagging System

Probabilistic models have been widely used for natural language processing. Part-of-speech tagging, which assigns the most likely tag to each word in a given sentence, is one of the problems which can be solved by statistical approach. Many researchers have tried to solve the problem by hidden Markov model (HMM), which is well known as one of the statistical models. But it has many diiculties: ...

متن کامل

Maximum Entropy Part-of-Speech Tagging in NLTK

In this paper we implement a part of speech tagger for NLTK using maximum entropy methods. Our tagger can be used as a drop-in replacement for any of the other NLTK taggers. We give a brief tutorial on how to use our tagger as well as describing the implementation at a high level. We evaluate our tagger on the Penn Tree Bank and compare our results to those of previous work.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012